High-dimensional mediation analysis for continuous outcome with confounders using overlap weighting method in observational epigenetic study

Background Mediation analysis is a powerful tool to identify factors mediating the causal pathway of exposure to health outcomes. Mediation analysis has been extended to study a large number of potential mediators in high-dimensional data settings. The presence of confounding in observational studies is inevitable. Hence, it’s an essential part of high-dimensional mediation analysis (HDMA) to adjust for the potential confounders. Although the propensity score (PS) related method such as propensity score regression adjustment (PSR) and inverse probability weighting (IPW) has been proposed to tackle this problem, the characteristics with extreme propensity score distribution of the PS-based method would result in the biased estimation. Methods In this article, we integrated the overlapping weighting (OW) technique into HDMA workflow and proposed a concise and powerful high-dimensional mediation analysis procedure consisting of OW confounding adjustment, sure independence screening (SIS), de-biased Lasso penalization, and joint-significance testing underlying the mixture null distribution. We compared the proposed method with the existing method consisting of PS-based confounding adjustment, SIS, minimax concave penalty (MCP) variable selection, and classical joint-significance testing. Results Simulation studies demonstrate the proposed procedure has the best performance in mediator selection and estimation. The proposed procedure yielded the highest true positive rate, acceptable false discovery proportion level, and lower mean square error. In the empirical study based on the GSE117859 dataset in the Gene Expression Omnibus database using the proposed method, we found that smoking history may lead to the estimated natural killer (NK) cell level reduction through the mediation effect of some methylation markers, mainly including methylation sites cg13917614 in CNP gene and cg16893868 in LILRA2 gene. Conclusions The proposed method has higher power, sufficient false discovery rate control, and precise mediation effect estimation. Meanwhile, it is feasible to be implemented with the presence of confounders. Hence, our method is worth considering in HDMA studies. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-024-02254-x.


Background
The analysis of the mediating effect was first proposed by Baron and Kenny (1986) [1] and was broadly applied in many scientific fields, such as psychological, sociological, and biomedical studies [2][3][4].Mediation analysis has become a powerful tool to investigate the underlying mechanism of environmental exposures on health outcomes and identify the factors mediating the effect of exposures on outcomes [5].Currently, analytical methods including the single mediator model [6,7], multiplemediators model [8], and high-dimensional mediation model [9] are proposed and available for researchers in many scientific fields.
With the development of advanced data collection techniques, high-dimensional data has become common in biomedical research.For example, in the epigenetic study, the Illumina Infinium HumanMethylation450 BeadChip array platform allows to measure the DNA methylation levels of roughly 480 K probes [10] and generates high dimensional data.Focusing on practical research, smoking affects lung function, and some DNA methylation sites may mediate the effect of smoking on lung function [11,12].To identify the significant mediators (CpG sites) between smoking and lung function, we can conduct mediation analysis in the collected high-dimensional data [9,13,14].Obviously, this method can be used to identify the methylation sites mediating the association between environmental factors other than smoking and other health outcomes including some physical signs and diseases.
However, there are also some issues in high dimensional mediation analysis (HDMA), such as the curse of dimensionality, the false positive rate inflation caused by multiplicity and the confounding existing in observational research.To overcome these issues, scholars have proposed a series of statistical methods.Zhang et al. [9].proposed the HIMA model consisting of variable screening based on sure independence screening (SIS), variable selection techniques based on minimax concave penalty (MCP) estimation and joint significance test.HIMA extends the multiple mediator framework to the highdimensional setting by incorporating variable screening and variable selection techniques into multiple mediation analysis.The following high-dimensional mediation analysis methods also employ the generic procedure [13][14][15][16], which reduces dimensionality from high to moderate or low scale and then conducts multiple mediation test.For example, the HIMA2 procedure proposed by Perera et al. [17], which employs the SIS method based on the indirect effect of every single mediator and conducts debiased Lasso to obtain more accurate estimates, then utilizes the multiple-testing procedure proposed by James et al. [18] to control the false discovery rate.Moreover, to adjust the confounders of observational epigenetic studies, researchers tried to integrate propensity score (PS) into the high-dimensional mediation model by weighting or considering it as a covariate [14,16], except for the classic regression adjustment.
Although many works have been made to tackle these problems, there are still some issues remaining in the dimensionality reduction and adjustment for confounders.For high dimensional mediation analysis, the previous studies don't take confounders into account, just consider them as covariates [15,19], such as HDMA, HIMA, and HIMA2 [5,9,17].As is known to all, the multivariable model cannot adequately account for confounding effects in the presence of a large number of confounders [20].If we only control confounding during the mediation test, but not in the dimension reduction stage, then a biased variable selection result may be obtained [14].Thus, it is necessary to adjust confounders to improve the performance of variable selection.
To address this issue, researchers have adopted the PS-based method including PS regression adjustment (termed PSR) and classical PS weighting (also called inverse probability weighting, IPW) to adjust confounding during both stages [14].However, the adjustment for confounders using the IPW based on PS still faces the issue of extreme weights caused by extreme PS distribution [21,22].To address the issue of extreme PS distribution, Li et al. [23].proposed the overlap weighting (OW) method, which emphasizes individuals with the most overlap in their observed characteristics and is beneficial to provide a consistent estimator of the effect of exposure on outcome in the presence of extreme PS tails.OW belongs to the weighting confounding adjustment method based on PS and is gaining more popularity because of excellent statistical properties [24,25].However, the above OW method is only applied to traditional epidemic analysis, which needs to be extended to mediation analysis and high-dimensional data setting.Besides, most of the existing methods all hold the independent assumption between potential mediators, which is hard to ensure in high dimensional epigenetic data analysis [5,9,[13][14][15]18].
In this article, we incorporated the OW method into HIMA [9] and HIMA2 [17] models, respectively.In order to develop the accuracy of the screening of potential mediators, we modified the framework of variable screening in the original HIMA2 procedure.Eventually, we proposed the OW-based modified HIMA2 (mHIMA2) procedure for HDMA.We evaluated the performance of the proposed procedure and the existing models through simulation studies.All the above evaluations are based on the simulation study and real data application.
The rest of the article is structured as follows.In the next section, we introduced the notions, assumptions, models, and the procedure of adjustment for confounders in the high-dimensional mediation analysis model.Then, we conducted the Monte Carlo simulation study to evaluate the performance of various methods of confounding adjustment and two different mediation test approaches.Additionally, we applied the proposed method to the dataset GSE117859 in the Gene Expression Omnibus (GEO) databases and identified some DNA methylation markers that mediate the effect of smoking on the estimated natural killer (NK) cell level.Finally, we concluded the advantages and limitations of this study.

Model definitions
Our high-dimensional mediation model is shown in Fig. 1.Let X be the exposure variable, where X = 1 represents the exposed group and X = 0 repre- sents the controlled group.Denote the outcome as Y, here we mainly focus on continuous outcome.Let T be the q-dimension baseline confounders which influence the relation of exposuremediator, mediator-outcome, and exposure-outcome.For individual i , i = 1,2, • • • , n , we have the high-dimensional mediation models as follows: (1) (2)

Assumptions
To ensure the identification of path-specific mediating effects, some assumptions need to be held as below.These assumptions were proposed referring to necessary condition required for high-dimensional mediation analysis suggested in published studies [8,15,17,19,26,27]: A1: There is no causal association between mediators.This means the proposed model contains only parallel mediators.
A2: Sequential ignorability.That consists of four assumptions listed below: (A2.1)There are no unmeasured confounders between the exposure and the outcome; (A2.2) There are no unmeasured confounders between the mediators and the outcome; (A2.3)There are no unmeasured confounders between the exposure and the mediators; (A2.4)There is no exposure-induced confounding between the mediators and the outcome.
A3: Stable unit treatment value assumption (SUTVA) [28,29] for both the mediators and the outcome.That is to say, there is no interference between individuals.
A4: Consistency for the mediators and the outcome.That is to say, there are no measurement errors in the mediators.
A5: Positivity assumption [30].Every individual has some positive probability of being exposed to the factor of interest.

Proposed Procedure
We improved the HIMA procedure proposed by Zhang et al. (2016) [9] and the HIMA2 procedure proposed by Perera et al. (2022) [17] under the condition of adjusting confounders in observational data.
In this study, we developed two processes to conduct the confounding-controlled high-dimensional mediation analysis.The detailed procedure is described in the following text.

Step 1: PS-based methods for adjusting confounders
Since there are always some baseline confounders in observational data, we integrate propensity score (PS) into mediators (and/or outcome) models to reduce the selection bias and acquire as accurate estimates of the mediation effect as possible.Due to the PS approaches allowing the inclusion of a large scale of confounders, PS is widely used in observational research.
PS is defined as the conditional probability that a study individual with baseline covariates would be exposed to certain study factors of interest [31]: PS can be estimated by classic multivariable statistical methods such as logistic regression [32] or by machine learning methods such as random forest (RF) and generalized boosted model (GBM) [33,34].In practice, logistic regression is the most commonly used.The PS of i th indi- vidual π i = P(X i = 1|C 1i , • • • , C li ) can be expressed as follows: T represents the effect of the confounders on the exposure.Then we can adopt some PS-based techniques to adjust confounders such as matching [35], stratification [36], regression [31], and weighting [37].Here, we focus on PS regression (PSR) and PS weighting [14] (PSW, also called IPW short for inverse probability weighting) techniques to adjust potential confounders between exposure, mediators and outcome.
PSR approach incorporates PS as a covariate into the original regression model to adjust for the probability of being exposed to study factors and to reduce confounding [32].That is similar to taking all confounders as covariates in a classical regression approach which usually uses the linear regression model for continuous outcomes and the logistic regression model for binary outcomes [38].For the PSR approach, we can estimate the effect through the models below: The PSW approach constructs the inverse probability weights by taking the reciprocal of PS.For binary exposure, the weight of the exposed group X = 1 is given as 1 PS , and that of the controlled group X = 0 as 1 1−PS .For i th individual: Then, we can estimate the coefficients of X in pathways X → M and M → Y by weighted estimation: where α k,ipw and γ ipw are the weighted estimation accord- ing to the ipw weight vector.However, the IPW often faces extreme PSs issue which may lead to extreme weights and result in biased estimates and excessive variance [23,24].
The overlap weighting (OW) approach was proposed to address the issue of extreme PSs [23].The overlap weight is given as 1 − PS for the group X = 1 and PS for the group X = 0 .Note that, individuals with PS of 0.5 make the largest contribution to the effect estimate, and individuals with PS close to 0 and 1 make the smallest con- tribution.OW is likely to be beneficial in the presence of extreme tail weights [23,39].For individual i: Then, the effect estimation of OW is similar to that of the PSW procedure: In the same way, α k,ow and γ ow are the weighted estima- tion using ow weight vector.

Step 2: Confounding-controlled SIS approach for dimensionality reduction
The SIS procedure is a general technique to reduce accurately high dimensions to below sample size [40].We adopt the SIS method to reduce dimension p from ultra- high dimension to moderate scale d = 2n log(n) [9,15].In this study, we considered two preliminary screening strategies as described in HIMA [9] and HIMA2 [17], based on the effects of M on Y ( β k ) and the indi- rect effect |α k β k | respectively.Because the indirect effects can be both positive and negative effects, to address the influence of the signs of the estimated indirect effects, the HIMA2 approach uses the absolute values of the indirect effect to obtain the size of the effect estimate regardless of the direction.This approach ensures that mediators with large effect size can be selected.
Due to the lack of screening accuracy in SIS based on indirect effects in the presence of confounders, we conducted the SIS screening based on the effects on the path M → Y controlling confounding effects using the OW approach. (4) (5) In simulation, we found that it is hard to select the true mediators based on |α k β k | in the pres- ence of confounding factors as applied in the original HIMA2 approach.So, we modified the frame of the HIMA2 method and both adopt SIS based on the effects on the path M → Y β k in the preliminary screening to select the subset of potential mediators Noticing that we need to adopt a two-step weighting method [14] to estimate β k for the PSW and OW methods.
First, γ k,w can be obtained from the following sub-model: where γ k,w is the estimator of γ k,ow or γ k,ipw for each M k .In addition, the residual e k can be derived: Then β k can be estimated by regressing e k on M k with- out weighting.Through the above SIS procedure, we can identify the important mediators and achieve the goal of dimensionality reduction.

Step 3: Penalized estimation
According to the HIMA procedure, after the preliminary selection of candidate mediators, further variable selection can be accomplished by the penalized estimation method.Here, we adopt the MCP [41] rather than other penalty functions, since the MCP approach has the oracle property which can select the correct model with probability tending to 1 as n → ∞ [15,41,42].
For the d-dimensional subset M SIS , we employed the MCP-penalized estimation to further select significant mediators set M MCP = {M k : β k � = 0, M k ∈ M SIS } , MCP penalty function can be defined as below: where > 0 is the regularization parameter which can be selected by AIC or BIC, and δ > 0 is the tuning param- eter which determines the concavity of MCP.The MCP procedure can be implemented through the R package ncvreg [43].Through MCP penalty estimation, we filtered out the mediators with too weak effects by combining SIS and MCP procedures and then acquired the small number of mediators that needed to be tested.That will help to obtain more accurate effect estimates.
Following the original HIMA2 procedure, the penalized estimation adopts the de-biased Lasso method to get the estimator β k and standard error σ β k .The sub-model of the de-biased Lasso method can be described below: where β SIS denote the effects of M k ∈ M SIS on Y .The cor- responding P-values P β k are given as: where �(.) is the cumulative distribution function of standard normal distribution N (0,1) .The de-biased Lasso method can be implemented with the R package hdi.

Step 4: PS-based multiple mediation test
After MCP-based penalized estimation, we use the Joint significance test [3,44] (termed JS-uniform) to test the mediation effect of M k ∈ M MCP .The Joint sig- nificance test considers the M k as a true mediator when α k and β k is significant simultaneously.Here, α k can be estimated through different confounding adjustment methods as shown in Eqs. 1, 3, 4, and 5. β k can be obtained using the linear regression with considering all confounders as covariates or only including PS (summary of all confounders) as a covariate.
In other words, that is based on the P-values for testing the path-specific effects H 0 : α k = 0 or H 0 : β k = 0 .The raw P-value for the joint significance test [3] is defined below: P raw,k = max P raw,α k , P raw,β k , #where P raw,α k and P raw,β k are the P-values for testing H 0 : α k = 0 and H 0 : β k = 0 .P raw,α k and P raw,β k can be obtained from the mediator model (e.g.Equations 1, 3, 4, and 5) and outcome model (Eq.2), respectively.
For the multiplicity (Type I error inflation) issue in multiple mediation testing, we adopted the Benjamini-Hochberg (BH) method [45,46] to acquire the adjusted p-values as below, where q is the number of potential mediators in the set M MCP , and r k is the location number of P raw,k when all the P-values P raw,k are sorted ascending.
However, the Joint significance test assumes P raw,k follows a uniform null distribution.Although P α k and P β k are each uniformly distributed, their maximum may not.Therefore, the Joint significance test results in a valid but overly conservative test with lower power [13,17,47].
Hence, we adopt the PS-based joint significance with mixture null distribution method [18] (termed JS-mixture) approach to conduct multiple mediation test after de-biased Lasso penalized estimation [17,48] referring to the classical HIMA2 procedure.The PS-based JS-mixture approach adopts a 3-component mixture distribution as below: The estimated pointwise FDR for testing mediation can be computed as: where t ∈ [0,1] , V 00 (t), V 01 (t), V 10 (t) denoting the numbers of the three types of false positives and The V 00 (t), V 01 (t), V 10 (t) and FDR(t) can be obtained using the R package HDMT.
We set the significance level of 0.05 for all the tests.The detailed processes of the proposed method are summarized in Fig. 2.

Simulation design
In this section, we conducted the simulation studies to evaluate the performance of the proposed method.The implementation of the simulation was based on R (version 4.3.0,R Foundation for Statistical Computing, Vienna, Austria) and RStudio (version 2023.9.0.463,RStudio: Integrated Development Environment for R, Boston, MA).The setting of simulation parameters was based on the published studies [9,14,16].The number , of replications in simulation study was set to be 500 for each combination of parameter setting referring to the replication times settings in published methodogical studies [9, 14-17, 19, 49].
The model structure is shown in Fig. 1.We consider affecting the rela- tionship of X , M , Y , in which continuous confound- ers C 1 − C 4 follow a multivariate normal distribution N (µ, �) with a mean vector µ = (0,0, 0,0) T and a covari- ance matrix : The last four binary confounders C 5 − C 8 are indepen- dently generated from the Binary distribution B(n, 0.3) , where n is the sample size.
Then exposure X can be generated from Binary distribu- tion B(n, P c ) , where n is the sample size, P c = 1/ 1 + e − θ T C , and Mediators M and the outcome variable Y are gener- ated according to Eqs. 1 and 2, respectively.For simplicity, we set all the effects of C on M to be the same.Let We set the first four potential mediators M 1 − M 4 as the true significant mediators in this study.Let 0.4,0.4,0.5,0.5,0.5,0.5,0, 0,• • • , 0) ; 0.4,0.5,0.5,0.6,0, 0,0.5,0.5,0,• • • , 0) .The elements of both α and β are equal to zero except for the first eight elements, and the first four are the significant mediators.The mediation effect size of the true mediators M 1 − M 4 is αβ 1−4 = (0.16,0.2,0.25,0.3).
To evaluate the impacts of sample size and potential mediators dimension, we set two sample size levels n = 300, 500 , and two dimension levels p=1000,10000.
In addition, we take the correlation between mediators into account in the condition of p=1000 dimension.We simulate the strong correlation between mediators by generating the error terms e k from N (0, � e ) , where � e = ρ |k−k ′ | k,k ′ .It means the correlation between two mediators will decrease as the absolute difference in mediators' subscript k − k ′ increases.We set four cor- relation levels ρ = 0, 0.25, 0.5,0.75 with dimension p =1000 and sample size n = 300, 500 .In the simulation Fig. 2 The overall workflow for high-dimensional mediation analysis under the adjusting for confounders condition setting ρ = 0, 0.25, 0.5,0.75 , the corresponding Pearson correlation coefficients between two adjacent mediators are around 0.4, 0.5, 0.7, and 0.8, respectively.We evaluated the performance of the mHIMA2 and PS-based HIMA by conducting 500 replications of simulated data sets for each scenario [9, 14-17, 19, 49].

Simulation results
Simulation results are presented in Tables 1 and 2. Evaluation of the performance of mediator selection of the proposed approach is shown in Table 1 by measuring the true positive rate (TPR) and false discovery proportion (FDP) of selection after the significance test for mediation effects.The mediators have higher TPR as the indirect effect increases (i.e., larger mediation effect, higher detection rate).
As presented in Table 1.Under most settings, the mHIMA2 mediation test approach has a higher TPR than PS-based HIMA while a higher FDP at the same time.Overall, the mHIMA2 is more powerful than the  1, for the mHIMA2 mediation test approach, TPR is ranked as OW > IPW > PSR > RA, and FDP is not more than 0.1 and gradually decreases to close to 0.05 as the sample size increases.Among all models, the mHIMA2 mediation test approach with OW adjustment has the highest power and acceptable false positive level.When using the PS-based HIMA mediation test approach, TPR is ranked consistently as RA > PSR > OW > IPW, and all four models also keep FDP at an extremely low level.
Table 2 presents the estimation of mediation effects with the mean and mean square error (MSE).The estimators approach the true values as the mediation effect increases.All models tend to be more accurate as n gets larger and p gets smaller.Overall, the mHIMA2 media- tion test approach has a smaller MSE than the PS-based HIMA approach in most cases.RA adjustment has a higher MSE than other adjustment methods especially when facing the large mediation effect, OW adjustment has the lower MSE among the four adjustment methods.
As shown in Table 2, similarly, the mHIMA2 approach with OW adjustment has the smallest MSE among all models.Moreover, similar results can be seen in the different strong correlation settings in Table S1-S8 in the supplementary file.The mHIMA2 methods have lower MSE (i.e. more precise estimation) and apparently higher TPR.That means the de-biased Lasso technique in mHIMA2 methods performs better when handling the moderate correlation between mediators.However, the FDP of all models slightly increases as the correlation between mediators increases.When correlation among the mediators is strong (for example, r>0.7), all models suffer in terms of increased MSE.

Data application
Smoking is an important environmental factor affecting the immune system and blood cell composition [50,51].Previous studies have demonstrated smokers had lower natural killer (NK) cell counts and activity [50,51].Smoking has also been found to be associated with DNA methylation levels [52].Meanwhile, DNA methylation levels have also been found to be associated with associated with human NK cell activation [53,54].Therefore, DNA methylation may mediate the association between smoking and NK cell level.So we implemented the proposed highdimensional mediation analysis methods to identify the specific functional CpG sites that may mediate the relationship between smoking and the estimated NK cell level.
Here we apply our method to the GSE117859 dataset obtained from the Gene Expression Omnibus (GEO) database.The aim of the study in which GSE117859 was originally measured is to explore the smoking-associated DNA methylation features linked to AIDS outcomes in the HIV-positive population [55].The blood samples from the Veteran Aging Cohort Study (VACS) were collected in that study.The HumanMethylation450 BeadChip platform was used to measure the DNA methylation levels.
In total 608 samplesand 485,577 probes were included in the dataset.Clinical information such as age, sex, race, smoking history, adherence of antiretroviral therapy (ART), estimated CD4 T cells, estimated CD8 T cells, and estimated NK cells were collected.The estimated CD4/CD8/NK were obtained using a methylation-based cell type deconvolution algorithm proposed by Housman et al. [56].To some extent, the estimated CD4 and CD8 levels can represent AIDS severity.
Smoking status was collected based on self-report.All included patients were classified into the smoker and the non-smoker groups according to their reported smoking history.After removing the individuals without available clinical information and DNAm sites with missing values, a total of 587 samples and 485,503 probes were included in the analysis.
We adjusted the potential confounders including age, race, adherence of antiretroviral therapy, estimated CD4 T cells, and estimated CD8 T cells.Demographic and clinical variables included in our analysis are presented in Table 3.
The analysis results using the proposed mHIMA2 method are presented in Table 4. Here, we mainly presented the CpGs mediators with a total effect proportion greater than 5%.Due to the limitation of text content, we didn't present the whole summary results of the PS-based HIMA method, but that can be seen in Table S9 in the supplementary file.
As shown in Table 4, we identified two methylation sites cg13917614 in CNP gene and cg16893868 in LILRA2 gene by most of mHIMA2 based methods.The similar result can be seen in Table S9 in the supplementary file.The existing studies have already demonstrated the site cg13917614 is associated with smoking [52,57].Although we don't find direct evidence that the CNP gene is associated with immune function based on the existing literature, relevant studies showed a link between CNP and inflammatory responses in which the mechanism remains further study [58,59].
The encoded protein of the LILRA2 gene can suppress innate immune response [60,61].The results reveal that smoking will promote the demethylation of cg16893868, leading to an increase in gene LILRA2 expression and ultimately reducing the estimated NK cell level.It has been found that the remaining CpG sites cg20460771, cg03164561, cg03605454, cg09529165, and cg01500140 are all associated with smoking [11,52,[62][63][64].Further insights into the discovered CpG mediators in genomewide epigenetic studies will be meaningful.

Discussion
The causal relationship obtained in high-dimensional mediation analysis usually depends on no-confounding assumption.However, confounding is almost inevitable in observational studies owing to the lack of randomization of the baseline covariates in practice.Previous studies show the utilization of PS method such as PS-adjustment and IPW in high-dimensional mediation analysis, but those face the issue of extreme PS distribution.
In this article, we integrated OW approach into the high-dimensional mediation model, which can address extreme PS distribution and better adjust for confounding.Finally, we developed a high-dimensional mediation analysis workflow consisting of OW confounding adjustment, SIS, de-biased Lasso penalization for potential mediator screening, and the high-dimensional mediation test underlying the mixture null distribution of P-values.
Simulation results indicate that the mHIMA2 with OW approach presented in this study performs best among all the compared models with the highest TPR, acceptable FDP level, and the smallest MSE in mediating effect estimation.In addition, the mHIMA2 embedded de-biased Lasso method performs better when moderate correlations between mediators exist.
Simulation study also suggestedthe proposed method would perform better when the sample size was increased.This result suggests that when the proposed method is used for the analysis of mediating effects on real data, a sufficient sample size should also be ensured.Such a feature is also consistent with other existing methods [5,9,14,17,19,49].Furthermore, the dimensionality of potential mediators has little effect on the performance of the proposed method.
In most of the previous studies [5,9,13,17], it didn't take confounding adjustment into account in the SIS process.However, we adopted the PS-based method to adjust confounding, thus improving the accuracy of mediators screening.Moreover, it has been assumed that mediators are linearly independent of each other, but such an assumption is often not strictly valid in real data.The violation of the mediators' independence assumption often affects the accuracy of mediators selection and precision of mediating effect estimation.The proposed method can effectively deal with this issue which can tolerate the correlation between the mediators and ensure the robustness of mediators selection, multiple mediation testing, and mediating effect estimation.Similar to other two-step approaches, the error of the first model may be introduced and cumulated in the second step, because the first-step can not quarantee 100% correctness.To avoid this, we set a relatively loose screening criterion with d = 2n/log(n) to select the top d largest effect mediators [15][16][17]49] in the first step to control false negative while avoiding the increase of false positive error according to the application recommendation of SIS approach.Though the errors cannot be totally avoid, this can reduce the error in the preliminary screening of mediators and prevent serious error cumulation in the second step to some extend.As shown in the simulation, the proposed two-step model performed well.Besides, previous published studies also have demonstrated the error cumulation issue in two-step models can be controlled well in the similar way as we did, and well not cause serious bias in the final results [14,[65][66][67][68][69][70].
Meanwhile, we applied the proposed method to the dataset GSE117859 obtained from the GEO databases and identified several significant DNAm mediators, including the sites cg13917614, cg16893868, cg20460771, cg03164561, cg03605454, cg09529165, and cg01500140.Among them, site cg16893868 in LILRA2 gene has been demonstrated to be associated with smoking and immune function [60,61].That indicates that the proposed method can identify reliable mediators in empirical data analysis.
The presence of confounding in observation studies always is a major challenge to obtaining causal relationships.Currently, most genetic studies are based on observational research without randomization of baseline characteristics.Particularly, the high-dimensional mediation analysis always faces some issues, such as the accuracy of the high-dimensional mediation selection and the low power of multiple mediation test [13,14,17,18].Although the utilization of PSR and IPW offers a solution of confounding adjustment in classical HDMA workflow, it still faces the issue of extreme PS distribution.
The proposed OW-based method can provide a more precise and stable mediating effect estimation.However, the misspecification of the outcome model and PS model can not be avoid in practice.Hence, the doubly robust methods may be desirable to be applied in HDMA workflow in future study.Even if the JS-mixture method was proposed to improve the power of multiple mediation testing, other more powerful test methods still are appealing in large-scale genome-wide epigenetic studies [13,18].Conducting further simulation and methodology studies to compare different powerful test methods may provide useful reference for future studies.It should also be noticed that the existence of unmeasured confounding is out of the scope of this paper.Previews published studies have provided serval applicable methods to deal with this issue [49,71].

Conclusion
Overall, the mHIMA2 with OW adjustment has sufficient power in selecting potential true mediators and obtaining precise estimation for mediation effects.It can be recommended in practical high-dimensional mediation analysis, especially in epigenetic study.

Table 1
TPR and FDP for the four true mediators (M1-M4) a MT methods denote two different mediation test approaches, including HIMA Zhang et al. and modified HIMA2 Perera et al. (termed mHIMA2) RA denotes regression adjustment.PSR denotes propensity score regression adjustment.IPW denotes inverse probability weighting.OW denotes overlapping weighting b CONF methods denote different confounding adjustment methods *

Table 2
Estimation results of mediation effects, expressed as Mean (MSE)a MT methods denote two different mediation test approaches, including HIMA and modified HIMA2 (termed mHIMA2).RA denotes regression adjustment.PSR denotes propensity score regression adjustment.IPW denotes inverse probability weighting.OW denotes overlapping weighting b CONF methods denote different confounding adjustment methods *

Table 3
Baseline characteristics of the HIV-positive patients included in the analysisa ART adherence of antiretroviral therapy

Table 4
Summary of the selected CpGs mediators with a %TE > 5 by the mHIMA2 models a Method denotes the combination of the mHIMA2 approach and different confounding adjustment methods.Such as mHIMA2-OW denotes the mHIMA2 method